Liver Disorders Dataset

In this article, we analyze the Liver Disorders Dataset from the UCI Machine Learning Repository.

Picture Source: niddk.nih.gov

Dataset Information:

The first five variables are all blood tests thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the dataset constitutes the record of a single male individual.

Important note: The 7th field (selector) has been widely misinterpreted in the past as a dependent variable representing the presence or absence of a liver disorder. This is incorrect [1]. The 7th field was created by BUPA researchers as a train/test selector. It is not suitable as a dependent variable for classification. The dataset does not contain any variable representing the presence or absence of a liver disorder. Researchers who wish to use this dataset as a classification benchmark should follow the method used in experiments by the donor (Forsyth & Rada, 1986, Machine learning: applications in expert systems and information retrieval) and others (e.g. Turney, 1995, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm), who used the 6th field (drinks), after dichotomizing, as a dependent variable for classification. Because of widespread misinterpretation in the past, researchers should take care to state their method clearly.

Attribute Information:

Attribute Information
MCV Mean corpuscular volume
AlkPhos Alkaline Phosphotase
Sgpt Alamine Aminotransferase
Sgot Aspartate Aminotransferase
GammaGT Gamma-Glutamyl Transpeptidase
Drinks Number of half-pint equivalents of alcoholic beverages drunk per day
Selector Field used to split data into two sets

Problem Description

In this article, the dependent variable is the number of drinks. Note that, Selector column is intended to split the data into train and test subsets for one particular experiment.

Features with high variance

Moreover, high variance for some features can hurt our modeling process. For this reason, we can standardize features by removing the mean and scaling to unit variance.

Train and Test sets

Modeling: CatBoost Regressor

CatBoost AI is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees.

The best result for each metric calculated on each validation dataset.

R2 Score


References

  1. Kaggle Dataset: Predict Future Sales
  2. CatBoost Documentation